Basic Project Introduction

In the midterm project of the course MUSA 507, our team aims to build a regression model and use the model to study related issues for the house price for the city of Boston. In this R Markdown, we are trying to explore the most proper answer to the question “What factors can mostly affect the house price in Boston and to what extent?” In this demo, we use the house sale price of Boston as a dependent variable and using different variables that we believe influence the house price as predictors.Via running the regression model that has strong prediction power, we should be able to find the most critical (statistically significant) relationships and be able to apply this prediction model for any study area in Boston. One thing we need to notice is that we are not trying to use this model to predict the future price, which can be influenced by many unsure factors. We are trying to use the model to summary our experience from the area that we already have the data and generalize it to the area that we do not have the data.

Brief introduction to the method of data gathering To fine the data that can work best for the regression model, we have several criteria for the proper data selection- - The spatial data need to have an independent coordinate system or map geometry, cannot be the spatial data depending on tracts. - The data should be updated lately. - The sample size of the data should be large enough. To make sure the reality of the dataset, our data come from the website “Boston Open Data,”American Facts" and “Mapzen.” We processed part of datasets inside the ArcMap to keep the R Code clean and brief.

Brief Introduction to the method. For building a prediction model that can be generalized precisely to another study area, we carefully follow the following steps to build up the model to make sure it can work best. In the regression model that we built, house price is the dependent variable, and other variables will be used as predictors. a. Data wrangling and data cleaning. We processed five predictors that we found from the open source in both R and the ArcMap to make sure they have the same field that can be used for future data process, and non-related fields are deleted to keep the data clean. b. When we are done with initial datasets, we can start to run the summary statistics and use Moran’s I to understand the significant level of the model we’ve built. c. After we get a reasonably good p-value from Moran’s I, that’s to say our model is usable; we can start to do the in-sample and out-of-sample prediction, which will be introduced in detail below in their sections. d. After we finish the ‘test’ and ‘training’ test in step c, we need to do one more method to improve our understanding for the generalization of the model, that is called “cross-validation.”

1.1. Data wrangling- setup

knitr::opts_chunk$set(echo = TRUE)
library(corrplot)
library(caret) 
library(dplyr)
library(AppliedPredictiveModeling)
library(stargazer)
library(ggmap)
library(tidyverse)
library(sf)
library(FNN)

1.1.1. Set the map them

Data Setup - Data Wrangling Set up the map theme for proper future mapping.

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

1.1.2. Add new variables (using r to calculate the near distance)

1.1.2.1 The distance to education

Read the data into R. Two datasets for the begining of data setup- Boston Houses Information and Education Points (surrounding schools)

biz <- st_read("Boston_Midterms_Dataset.shp") 
bostonSalesnew <- st_read("Boston_Midterms_Dataset.shp") 
educationfinal <- st_read("education_final.shp")

First, we need to wrangle a proper data frame to containing the data required for the regression calculation. We firstly use Education as a predictor for the dependent varaible “Home Price”

st_crs(bostonSalesnew) == st_crs(educationfinal)

homeprice <- st_read("Boston_Midterms_Dataset.shp") %>%
  dplyr::select(geometry) %>%
  st_transform(4326) %>%
  mutate(landUse = "homeprice")

education <- st_read("education_final.shp") %>%
  dplyr::select(geometry)%>%
  st_transform(4326) %>%
  mutate(landUse = "education")
allPlaces <- rbind(homeprice, education)
allPlaces <- cbind(as.data.frame(st_coordinates(allPlaces)), data.frame(allPlaces))
pghBaseMap <- get_map(location = "boston, Massachusetts", 
                      source = "stamen", 
                      zoom = 11, 
                      maptype= 'toner')
ggmap(pghBaseMap) + 
  geom_point(data = allPlaces, 
             aes(x=X, y=Y, color=landUse), 
             size = 0.5) + 
  labs(title="homeprice & education around boston") +
  scale_colour_manual(values = c("darkgreen","red"),
                      name = "Land Use",
                      labels=c("homeprice","education")) +
  mapTheme()

In this step we predict the average nearest neighbor distance, that how far each house point in the model is from its 5 nearest education points (schools). The first two statements creates two matrices of coordinates for each LandUse type. The output “nn” contains two data types - the “nn.index” and “nn.dist”. “nn.index” lists for each house, the nearest 5 education neighbors it has. “nn.dist” lists the distance for the home point to its nearest 5 education points.

homepriceXY <-
  allPlaces %>%
  filter(landUse == "homeprice") %>%
  dplyr::select(X,Y) %>%
  as.matrix()   

educationXY <-
  allPlaces %>%
  filter(landUse == "education") %>%
  dplyr::select(X,Y) %>%
  as.matrix() 
nn = get.knnx(educationXY,homepriceXY,k=5)
names(nn)

This step takes the average distance from each home point to its nearest five neighbors. At first we creates a new data frame called biz2. Inside this new data frame, we create a new field containing each unique education point. We then calculate the mean distance from each house point to its nearest 5 school points. Finally we add the list of mean distance into the data fram biz2 Below is the Nearest Neighbor of Education - Quantile Break Map for the first independent variable - Education (School Location)

biz2 <-
  as.data.frame(nn$nn.dist) %>%
  rownames_to_column(var = "home_prices") %>%
  gather(education, education_Distance, V1:V5) %>%
  arrange(as.numeric(home_prices)) %>%
  group_by(home_prices) %>%
  summarize(d_education = mean(education_Distance)) %>%
  arrange(as.numeric(home_prices)) %>% 
  dplyr::select(-home_prices) %>%
  bind_cols(biz)
  
#visualize the distance

ggmap(pghBaseMap) + 
  geom_point(data = biz2, 
             aes(x=Longitude, y=Latitude, color=factor(ntile(d_education,5))), 
             size = 2) + 
  geom_point(data = subset(allPlaces, landUse == "education"), 
             aes(x=X, y=Y), colour = "red") + 
  labs(title="Distance from homeprice to 5 nearest education",
       subtitle = "education in red") +
  scale_colour_manual(values = c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494"),
                      labels=as.character(quantile(biz2$d_education,
                                                   c(.1,.2,.4,.6,.8),na.rm=T)),
                      name="distance\n(in thousands)\n (Quintile Breaks)") +
  mapTheme()

##1.1.2.2. add the distance to transportation point In the next step, we repeat the data wrangling process as same as we did for the education dataset. Here we substitute the group of education points with the group of transportation points and calculated the average distance from each house point to its nearest five transportation points (bus stations). Through summarize the data we get the list of mean distance, adding it to the new data frame biz3. The result map “HomePrice & Transportation point around Boston” expresses defined house points and all the transportation points in our dataset.

trans <- st_read("boston_massachusetts_osm_transport_points.shp")
st_crs(trans) == st_crs(bostonSalesnew)
homeprice <- st_read("Boston_Midterms_Dataset.shp") %>%
  dplyr::select(geometry) %>%
  st_transform(4326) %>%
  mutate(landUse = "homeprice")

trans <- st_read("boston_massachusetts_osm_transport_points.shp") %>%
  dplyr::select(geometry) %>%
  st_transform(4326) %>%
  mutate(landUse = "transportation")

allPlaces <- rbind(homeprice, trans)
allPlaces <- cbind(as.data.frame(st_coordinates(allPlaces)), data.frame(allPlaces))
pghBaseMap <- get_map(location = "boston, Massachusetts", 
                      source = "stamen", 
                      zoom = 11, 
                      maptype= 'toner')
## Warning in file.remove(index[[url]]): cannot remove file
## 'cf9927972e1e9e1e87cac919a306ca28.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'a2c2e118e40be512c32f229d2412dadd.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '344d6f31d33fc43adb0285ae1371e00b.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '4aa5e2f76f79087973ac525296f53bd0.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '25dbeea864984902f468d69b481e5919.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '5e6067a1d989d5e453a0319c43a329de.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'abd0f5d80d37772bbe98d812e357622f.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '39f725134c98442fc9abf23331d0c699.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '96caf1141d87c718156ba858330702c0.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'fd244006c0a067294a449edcf688e1e4.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'f7b953c805d87e3a74f664a3a0bea950.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'e7468d204578b36884c0c1f2b9114f25.rds', reason 'No such file or directory'
ggmap(pghBaseMap) + 
  geom_point(data = allPlaces, 
             aes(x=X, y=Y, color=landUse), 
             size = 0.1) + 
  labs(title="homeprice & transportation point around boston") +
  scale_colour_manual(values = c("darkgreen","red"),
                      name = "Land Use",
                      labels=c("homeprice","transportation")) +
  mapTheme()

homepriceXY <-
  allPlaces %>%
  filter(landUse == "homeprice") %>%
  dplyr::select(X,Y) %>%
  as.matrix()   

transXY <-
  allPlaces %>%
  filter(landUse == "transportation") %>%
  dplyr::select(X,Y) %>%
  as.matrix() 

nn = get.knnx(transXY,homepriceXY,k=5)

names(nn)

biz3 <-
  as.data.frame(nn$nn.dist) %>%
  rownames_to_column(var = "home_prices") %>%
  gather(trans, trans_Distance, V1:V5) %>%
  arrange(as.numeric(home_prices)) %>%
  group_by(home_prices) %>%
  summarize(d_trans = mean(trans_Distance)) %>%
  arrange(as.numeric(home_prices)) %>% 
  dplyr::select(-home_prices) %>%
  bind_cols(biz2)

The final map for the predictor of Transportation ranks each house point by using their mean distance to their five nearest neighbors with the Quantile method. Darker color means far distance. Below is the Nearest Neighbor of Transportation - Quantile Break Map for the second independent variable - Transportation (Bus Station)

ggmap(pghBaseMap) + 
  geom_point(data = biz3, 
             aes(x=Longitude, y=Latitude, color=factor(ntile(d_trans,5))), 
             size = 1)  + 
  labs(title="Distance from homeprice to 5 nearest transportation point",
       subtitle = "transportation point in red") +
  scale_colour_manual(values = c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494"),
                      labels=as.character(quantile(biz3$d_trans,
                                                   c(.1,.2,.4,.6,.8),na.rm=T)),
                      name="distance\n(in thousands)\n (Quintile Breaks)") +
  mapTheme()

1.1.4. using gis to do the density process to get 3 more variables and add these 3 variables to the combined data.

1.1.4.1. Crime density visualiztion Calculate the crime density for each house point. Data is being processed in ArcMap. Darker color represent more crime for the point location.

CRIME<- st_read("crime.shp")

ggmap(pghBaseMap) + 
  geom_point(data = CRIME, 
             aes(x=Longitude, y=Latitude, color=factor(ntile(CRIME$RASTERVALU,5))), 
             size = 1)  + 
  labs(title="Crime density in Boston") +
  scale_colour_manual(values = c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494"),
                      labels=as.character(quantile(CRIME$RASTERVALU,
                                                   c(.1,.2,.4,.6,.8),na.rm=T)),
                      name="crime density\n(in thousands)\n (Quintile Breaks)") +
  mapTheme()

##1.1.4.2 road_density Calculate the relative density of road for each house point. Data is being processed in ArcMap. Darker color represents denser route

road_dens<- st_read("road_densfinal.shp")

ggmap(pghBaseMap) + 
  geom_point(data = road_dens, 
             aes(x=Longitude, y=Latitude, color=factor(ntile(road_dens$RASTERVALU,5))), 
             size = 1)  + 
  labs(title="road density in Boston") +
  scale_colour_manual(values = c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494"),
                      labels=as.character(quantile(road_dens$RASTERVALU,
                                                   c(.1,.2,.4,.6,.8),na.rm=T)),
                      name="road density\n(in thousands)\n (Quintile Breaks)") +
  mapTheme()

1.1.5 add the spatial dimension data

Via using command st_crs and st_transform, we unified the spatial dimention of datasets in use. Through adding the data into the table, we now have new dataset called “bostonSales_joinNhoods” containing both the house sales information and the census tracts information of Boston. Through extracting and combining useful information from multiple datasets, we now get a new dataset called “try1”, which has all the regression calculation factors that we need, including three predictors we use to predict the house price.

bostonSales <- st_read("Boston_Midterms_Dataset.shp") 
bostonNhoods <- st_read("Boston_Neighborhoods.shp")
bostontract<- st_read("Census2010_Tracts.shp")

st_crs(bostonSales) == st_crs(bostonNhoods)
bostontract <- st_transform(bostontract, st_crs(bostonSales))
st_crs(bostonSales) == st_crs(bostontract)
bostonSales_joinNhoods <- st_join(bostonSales, bostonNhoods)
bostonSales_joinNhoods <- st_join(bostonSales_joinNhoods, bostontract)


finalregressiondata <- read_csv("finalregressiondata_csv.csv")
## Warning: Missing column names filled in: 'X33' [33]
locationfactor<-bostonSales_joinNhoods[,c(7,9,12,15,17:19,22:41,60:63,65,73)]
newdatafinal<-bind_cols(locationfactor,finalregressiondata)
try1<-newdatafinal[,c(1:34,36,46,68:73)]
names(try1)
unlist(lapply(try1, class))

1.2. Data Wrangling-Explantory Analysis

Now, since the data is cleaned and we get all the varaibles we need in the “try1”, we can start to run the regression model for the hosue price prediction.

baseMap <- get_map(location = "boston", 
                   source = "stamen", 
                   zoom = 11, 
                   maptype= 'toner')
## Warning in file.remove(index[[url]]): cannot remove file
## 'cf9927972e1e9e1e87cac919a306ca28.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'a2c2e118e40be512c32f229d2412dadd.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '344d6f31d33fc43adb0285ae1371e00b.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '4aa5e2f76f79087973ac525296f53bd0.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '25dbeea864984902f468d69b481e5919.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '5e6067a1d989d5e453a0319c43a329de.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'abd0f5d80d37772bbe98d812e357622f.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '39f725134c98442fc9abf23331d0c699.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## '96caf1141d87c718156ba858330702c0.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'fd244006c0a067294a449edcf688e1e4.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'f7b953c805d87e3a74f664a3a0bea950.rds', reason 'No such file or directory'
## Warning in file.remove(index[[url]]): cannot remove file
## 'e7468d204578b36884c0c1f2b9114f25.rds', reason 'No such file or directory'
ggmap(baseMap) + 
  geom_point(data = try1, 
             aes(x=Longitude, y=Latitude, color=factor(ntile(SalePrice,5))), 
             size = 1) + 
  labs(title="saleprice, boston") +
  scale_colour_manual(values = c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494"),
                      labels=as.character(quantile(try1$SalePrice,
                                                   c(.1,.2,.4,.6,.8),na.rm=T)),
                      name="salepricee\nin thousands\n (Quintile Breaks)") +
  mapTheme()

After the final clean steps, we create one dataset called “numericdata”, which is added sales information for each house point. Command “stargazer” produce the table of

for each factors contained in the newest dataset “numericdata” We then create the correlation matrix to better display the correaltion between varaibles.

unlist(lapply(try1, class))
numericdata<-try1 [,c(10,13,14,15,20:24,35,36:38,40,41)]
#delete the  geometry
numericdata <-
  numericdata %>%
  as.data.frame() %>%
  select(-geometry) %>%
  na.omit()
#add the saleprice on the numericdata
saleprice<-try1 [,c(34)]
saleprice <-
  saleprice %>%
  as.data.frame() %>%
  select(-geometry) %>%
  na.omit()
numericdata<-bind_cols(saleprice,numericdata)
unlist(lapply(numericdata, class))


#statistic summary
stargazer(numericdata, type="text", title = "Summary Statistics")
# plot the correlation map
M <- cor(numericdata)
bizCor <- numericdata
unlist(lapply(bizCor, class))
M <- cor(bizCor)
corrplot(M, method = "number")

##2.1: Modeling - In-sample prediction In this section, we starts to build models. We are going to use all the data to build the model, which is a process called “in-sample prediction” The result of the model allows us to see if there’s any variables is insignificant (unusful for the regression), and we would only keep the varaible that are statistical significant. Description of Variables Dependent Variable - SalesPrice (Sales Price for each house point inside the sales dataset) Independent variable - highway_dis (Mean distance for each house point from ‘distance to five nearest highways’) - road_dens (Road relative density for each house point) - crime_dens(Crime relative density for each house point) - d_trans (Mean distance for each house point from “distance to five nearest transportation points”) - d_education (Mean distance for each house point from “distance to five nearest education points(schools)”) - ZIPCODE (Zipcode for each house point’s location) - PTYPE (Property type) - OWN_OCC (Whether the house point is occupied) - LAND_SF (Land square foot for each house point) - YR_REMOD1 (Yearly information of house point remodification) - R_EXT_FIN - GROSS_AREA (Gross area for each house point) - R_TOTAL_RM (Total room number for each house point) - R_BDRMS (Total bedroom number for each house point) - R_FULL_BTH (Total full bathroom number for each house point) - R_HALF_BTH (Total half bathroom number for each house point) - R_KITCH (Total kitchen number for each house point)

try1 <-
  try1 %>%
  mutate(ZIPCODE = as.factor(ZIPCODE),
         PTYPE = as.factor(PTYPE),
         R_FPLACE= as.factor(R_FPLACE),
         YR_REMOD = as.numeric(YR_REMOD)) %>%
  as.data.frame() %>%
  filter(test == 0)%>%
na.omit()
#run the regression
reg <- lm(log(SalePrice) ~  ZIPCODE+PTYPE+OWN_OCC+LAND_SF+YR_REMOD1+GROSS_AREA
           +LIVING_ARE+NUM_FLOORS+R_ROOF_TYP+R_EXT_FIN+
             R_TOTAL_RM+R_BDRMS+R_FULL_BTH+R_HALF_BTH+R_KITCH+
             highway_dis+road_dens+crime_dens+d_trans+d_education+Name, 
           data = try1) 

summary(reg)

The Multiple R-Squared is 0.8495, that is a pretty good result, but r^2 is only one type of “goodness”. In this step we are going to use another way to test the model. A good prediction model should be able to produce prodicted value that are highly unified with the observed value collected from the real world. The blue line in the below graph plots is produced from real-world data and black points are our prediction values.

reg$fitted.values<-exp(reg$fitted.values)
regDF <- cbind(try1$SalePrice, reg$fitted.values)
colnames(regDF) <- c("Observed", "Predicted")
regDF <- as.data.frame(regDF)
ggplot() + 
  geom_point(data=regDF, aes(Observed, Predicted)) +
  stat_smooth(data=regDF, aes(Observed, Observed), method = "lm", se = FALSE, size = 1) + 
  labs(title="Predicted Sales Volume as a function\nof Observed Sales Volume") +
  theme(plot.title = element_text(size = 18,colour = "black"))

Make the histrogram for regression residuals

hist(abs(try1$SalePrice - reg$fitted.values), breaks=50, main="Histrogram of residuals (absolute values)")

Via exploring the residual, we create the residual map to see the difference between the predicted data we received and the real-world observed data. In the map displayed below , we get the residual range from 0.5 to -0.5.When the residual = 0, our predicted value is most similar to the real world observed value.

#residual map
reg_residuals <- data.frame(reg$residuals)
bosLonLat <- data.frame(try1$Longitude, try1$Latitude)
residualsToMap <- cbind(bosLonLat, reg_residuals )
colnames(residualsToMap) <- c("longitude", "latitude", "residual")

invert <- function(x) rgb(t(255-col2rgb(x))/255)    
baseMap_invert <- as.raster(apply(baseMap, 2, invert))
class(baseMap_invert) <- class(baseMap)
attr(baseMap_invert, "bb") <- attr(baseMap, "bb")
ggmap(baseMap_invert)+
  mapTheme()

ggmap(baseMap_invert) + 
  geom_point(data = residualsToMap, 
             aes(x=longitude, y=latitude, color=residual), 
             size = 0.1) + 
  labs(title="Residual of reg, Boston") +
  scale_color_gradient(low="green", high="pink")

In the last part of this section, we use residuals that we calculated before to apply the Moran’s I calculation in order to test the residual. The resutl “p-value < 2.2e-16” shows to us that the Moran’s I is statistical significant, our model is ok to use for predictions.

#moran's I
library(spdep)
#moran's i
coords <- cbind(residualsToMap$longitude, residualsToMap$latitude)
spatialWeights <- knn2nb(knearneigh(coords, 4))
moran.test(reg$residuals, nb2listw(spatialWeights, style="W"))

2.2 Modeling - Out of Sample Prediction

Becuase the goal of this regression model is to predict the house price for the area that we don’t have the data.In order to simulate the out-of-sample usefulness of the model, we seperates the house price data into test and training set. We will use “training” set to train the model and the “test” set to validate the model. If the model can predict well in the “test”" set, we then can have confidence for the generalization prediction power of the model. In the data training code below, we take 75% of the random sample data, which means the model will randomly select 75% data from the sample house point dataset and run the regression based on the selected sample points, which can effectively test the prediction power of the model for variaty of study areas.

inTrain <- createDataPartition(
  y = try1$SalePrice, 
  p = .75, list = FALSE)
training <- try1[ inTrain,] #the new training set
test <- try1[-inTrain,]  #the new test set
nrow(training) / nrow(try1)
reg2  <- lm(log(SalePrice) ~  ZIPCODE+PTYPE+OWN_OCC+LAND_SF+YR_REMOD1+GROSS_AREA
                     +LIVING_ARE+NUM_FLOORS+R_ROOF_TYP+R_EXT_FIN+
                       R_TOTAL_RM+R_BDRMS+R_FULL_BTH+R_HALF_BTH+R_KITCH+
                       highway_dis+road_dens+crime_dens+d_trans+d_education+Name, 
                     data = try1) 
reg2Pred <- predict(reg2, test) 
## Warning in predict.lm(reg2, test): prediction from a rank-deficient fit may
## be misleading
#plot
reg2Pred<-exp(reg2Pred)
regDF2 <- cbind(test$SalePrice, reg2Pred)
colnames(regDF2) <- c("Observed", "Predicted")
regDF2 <- as.data.frame(regDF2)
ggplot() + 
  geom_point(data=regDF2, aes(Observed, Predicted)) +
  stat_smooth(data=regDF2, aes(Observed, Observed), method = "lm", se = FALSE, size = 1) + 
  labs(title="Predicted Sales Volume as a function\nof Observed Sales Volume") +
  theme(plot.title = element_text(size = 18,colour = "black"))

After the data training, we now can priliminaraly know the quality of generalization for our model. In order to better understand the level of generalization of the model, we now need to process MAPE (Mean absolute percentage error), which is an effective way to predict the accuracy of the model. In the following codes, in order to improve the ‘goodness’ of fit matrices, we are going to create a new data frame which contains three new variables - - error - the residuals (the difference between observed value and predicted value). Something Good: The fit indicator is in the same units as the original data. - absError - the absolute value of the error (residuals), use to check the total amount of residuals for the model, more residual = less precise prediction result. - PercentAbsError - Absolute value of the error in percentage format. More direct way to understand the residual of the sales price regression model.

#2.2 Modeling 
inTrain <- createDataPartition(
  y = try1$SalePrice, 
  p = .75, list = FALSE)
training <- try1[ inTrain,] #the new training set
test <- try1[-inTrain,]  #the new test set
nrow(training) / nrow(try1)
## [1] 0.7515198
reg2  <- lm(log(SalePrice) ~  ZIPCODE+PTYPE+OWN_OCC+LAND_SF+YR_REMOD1+GROSS_AREA
                     +LIVING_ARE+NUM_FLOORS+R_ROOF_TYP+R_EXT_FIN+
                       R_TOTAL_RM+R_BDRMS+R_FULL_BTH+R_HALF_BTH+R_KITCH+
                       highway_dis+road_dens+crime_dens+d_trans+d_education+Name, 
                     data = try1) 
reg2Pred <- predict(reg2, test) 
## Warning in predict.lm(reg2, test): prediction from a rank-deficient fit may
## be misleading
#plot
names(reg2Pred)
##   [1] "5"    "8"    "12"   "13"   "15"   "24"   "37"   "40"   "42"   "47"  
##  [11] "54"   "61"   "69"   "70"   "77"   "82"   "85"   "91"   "94"   "98"  
##  [21] "106"  "107"  "121"  "133"  "138"  "140"  "142"  "146"  "149"  "153" 
##  [31] "154"  "162"  "165"  "167"  "172"  "173"  "177"  "185"  "187"  "191" 
##  [41] "192"  "199"  "202"  "209"  "211"  "218"  "219"  "230"  "231"  "233" 
##  [51] "234"  "240"  "241"  "242"  "243"  "247"  "248"  "249"  "252"  "258" 
##  [61] "260"  "273"  "276"  "277"  "281"  "292"  "293"  "304"  "307"  "309" 
##  [71] "310"  "319"  "323"  "331"  "337"  "344"  "345"  "351"  "352"  "353" 
##  [81] "354"  "359"  "360"  "361"  "362"  "373"  "376"  "377"  "384"  "385" 
##  [91] "392"  "399"  "403"  "404"  "409"  "412"  "417"  "423"  "425"  "429" 
## [101] "430"  "431"  "436"  "439"  "443"  "449"  "451"  "453"  "464"  "469" 
## [111] "474"  "475"  "480"  "492"  "494"  "495"  "499"  "500"  "505"  "511" 
## [121] "514"  "516"  "522"  "523"  "524"  "529"  "537"  "540"  "543"  "552" 
## [131] "557"  "558"  "562"  "565"  "568"  "572"  "576"  "580"  "585"  "594" 
## [141] "597"  "602"  "607"  "608"  "609"  "615"  "629"  "630"  "631"  "634" 
## [151] "635"  "638"  "639"  "640"  "643"  "650"  "652"  "655"  "666"  "673" 
## [161] "675"  "676"  "678"  "681"  "682"  "692"  "693"  "705"  "710"  "715" 
## [171] "726"  "727"  "730"  "731"  "733"  "739"  "743"  "746"  "751"  "755" 
## [181] "756"  "785"  "787"  "788"  "791"  "798"  "800"  "807"  "810"  "813" 
## [191] "814"  "825"  "826"  "831"  "833"  "837"  "839"  "840"  "843"  "844" 
## [201] "845"  "847"  "848"  "849"  "851"  "853"  "855"  "856"  "857"  "859" 
## [211] "871"  "872"  "887"  "892"  "899"  "907"  "909"  "910"  "917"  "920" 
## [221] "927"  "930"  "937"  "948"  "951"  "955"  "956"  "963"  "964"  "965" 
## [231] "966"  "967"  "972"  "974"  "975"  "989"  "1000" "1002" "1004" "1006"
## [241] "1010" "1014" "1018" "1019" "1021" "1022" "1023" "1025" "1027" "1039"
## [251] "1041" "1044" "1045" "1061" "1065" "1066" "1067" "1069" "1070" "1072"
## [261] "1073" "1077" "1079" "1081" "1085" "1092" "1098" "1100" "1107" "1111"
## [271] "1113" "1126" "1127" "1130" "1131" "1132" "1138" "1141" "1142" "1144"
## [281] "1147" "1152" "1161" "1163" "1164" "1168" "1170" "1171" "1173" "1178"
## [291] "1182" "1185" "1187" "1192" "1197" "1201" "1205" "1210" "1212" "1216"
## [301] "1224" "1226" "1228" "1237" "1239" "1240" "1242" "1244" "1246" "1249"
## [311] "1252" "1268" "1279" "1281" "1283" "1293" "1297" "1304" "1306" "1313"
## [321] "1318" "1320" "1321" "1323" "1324" "1327" "1338"
reg2Pred<-exp(reg2Pred)
head(reg2Pred)
##        5        8       12       13       15       24 
## 401769.7 327304.5 358968.0 349104.7 563336.5 429167.0
regDF2 <- cbind(test$SalePrice, reg2Pred)
colnames(regDF2) <- c("Observed", "Predicted")
regDF2 <- as.data.frame(regDF2)
ggplot() + 
  geom_point(data=regDF2, aes(Observed, Predicted)) +
  stat_smooth(data=regDF2, aes(Observed, Observed), method = "lm", se = FALSE, size = 1) + 
  labs(title="Predicted Sales Volume as a function\nof Observed Sales Volume") +
  theme(plot.title = element_text(size = 18,colour = "black"))

#new fit metrics
reg2PredValues <- 
  data.frame(observedSales = test$SalePrice,
             predictedSales = reg2Pred)

reg2PredValues <-
  reg2PredValues %>%
  mutate(error = predictedSales - observedSales) %>%
  mutate(absError = abs(predictedSales - observedSales)) %>%
  mutate(percentAbsError = abs(predictedSales - observedSales) / observedSales) 

head(reg2PredValues)
##   observedSales predictedSales     error  absError percentAbsError
## 1        355000       401769.7  46769.72  46769.72      0.13174570
## 2        363000       327304.5 -35695.52  35695.52      0.09833476
## 3        300000       358968.0  58968.00  58968.00      0.19656001
## 4        240000       349104.7 109104.66 109104.66      0.45460274
## 5        607000       563336.5 -43663.50  43663.50      0.07193329
## 6        412000       429167.0  17166.98  17166.98      0.04166743
mean(reg2PredValues$absError)
## [1] 88449.58
mean(reg2PredValues$percentAbsError)
## [1] 0.1346596
#new fit metrics
reg2PredValues <- 
  data.frame(observedSales = test$SalePrice,
             predictedSales = reg2Pred)

reg2PredValues <-
  reg2PredValues %>%
  mutate(error = predictedSales - observedSales) %>%
  mutate(absError = abs(predictedSales - observedSales)) %>%
  mutate(percentAbsError = abs(predictedSales - observedSales) / observedSales) 

head(reg2PredValues)
##   observedSales predictedSales     error  absError percentAbsError
## 1        355000       401769.7  46769.72  46769.72      0.13174570
## 2        363000       327304.5 -35695.52  35695.52      0.09833476
## 3        300000       358968.0  58968.00  58968.00      0.19656001
## 4        240000       349104.7 109104.66 109104.66      0.45460274
## 5        607000       563336.5 -43663.50  43663.50      0.07193329
## 6        412000       429167.0  17166.98  17166.98      0.04166743
mean(reg2PredValues$absError)
## [1] 88449.58
mean(reg2PredValues$percentAbsError)
## [1] 0.1346596
#3 new add**************residual as a function of the observation 25% data


ggplot() + 
  geom_point(data=reg2PredValues, aes(observedSales, error)) +
  labs(title="Residual as a function\nof Observed Sales ") +
  theme(plot.title = element_text(size = 18,colour = "black"))

#3 new add**************residual as a function of the predicted 25% data

ggplot() + 
  geom_point(data=reg2PredValues, aes(predictedSales, error)) +
  labs(title="Residual as a function\nof predicted Sales ") +
  theme(plot.title = element_text(size = 18,colour = "black"))

##2.2.3. New added residual as a function of the observation 25% data Add New added residual as a function of the predicted 25% data

ggplot() + 
  geom_point(data=reg2PredValues, aes(observedSales, error)) +
  labs(title="Residual as a function\nof Observed Sales ") +
  theme(plot.title = element_text(size = 18,colour = "black"))

#3 new add**************residual as a function of the predicted 25% data

ggplot() + 
  geom_point(data=reg2PredValues, aes(predictedSales, error)) +
  labs(title="Residual as a function\nof predicted Sales ") +
  theme(plot.title = element_text(size = 18,colour = "black"))

New added MAPE (Mean absolute percentage error) by neighborhood In this section, we added a few more predictors (year to built; number of floors; year to remodified) to run the MAPE method, which can help us understand more deeply about the level of prediction accuracy of our regression model.

library(tidyverse)
library(sf)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

bostonSales <- st_read("Boston_Midterms_Dataset.shp") 
bostonNhoods <- st_read("Boston_neighborhoods.shp")
st_crs(bostonSales) == st_crs(bostonNhoods)
bostonNhoods <- st_transform(bostonNhoods, st_crs(bostonSales))
bostonSales_joinNhoods <- st_join(bostonSales, bostonNhoods)
vars <- bostonSales_joinNhoods[,c(6,9,18,25:40,61:63,65)]
library(dplyr)
vars <-
  vars %>%
  mutate(YR_BUILT = as.factor(YR_BUILT),
         NUM_FLOORS = as.factor(NUM_FLOORS),
         YR_REMOD = as.factor(YR_REMOD)) %>%
  as.data.frame() %>%
  select(-geometry) %>%
  filter(test == 0) %>%
  na.omit()

devtools::install_github("rstudio/ggvis", build_vignettes = FALSE)
reg2Residuals <- 
  data.frame(residuals = exp(reg2$residuals),
             SalePrice = vars %>% select(SalePrice),
             Longitude = vars %>% select(Longitude),
             Latitude = vars %>% select(Latitude),
             Name = vars %>% select(Name),
             Legend = "reg2")

reg2Residuals_Summary <-
  reg2Residuals %>%
  group_by(Name) %>%
  summarize(meanResidual = mean(residuals, na.rm=TRUE),
            sdResidual = sd(residuals, na.rm=TRUE),
            mean(abs(exp(reg2$fitted.values) - vars$SalePrice) / vars$SalePrice),
            countSales = n()) %>%
  mutate(Legend="reg2") %>%
  mutate(mean="mean(abs(exp(reg2$fitted.values) - vars$SalePrice) / vars$SalePrice)") %>%
  filter(countSales > 5) %>%
  left_join(bostonNhoods) %>%
  st_sf()

mean(abs(exp(reg2$fitted.values) - vars$SalePrice) / vars$SalePrice)


ggplot() + 
  geom_sf(data=bostonNhoods, aes(), fill=NA, colour="black", size=1) +
  geom_point(data=reg2Residuals, 
             aes(Longitude,Latitude, color=factor(ntile(residuals,5))),size=1) +
  
  scale_colour_manual(values = c("#edf8fb","#b3cde3","#8c96c6","#8856a7","#810f7c"),
                      labels=as.character(quantile(reg2Residuals_Summary$`mean(abs(exp(reg2$fitted.values) - vars$SalePrice)/vars$SalePrice)`,
                                                   c(.1,.2,.4,.6,.8),na.rm=T)),
                      name="Residuals \n(Quintile Breaks)") +
  mapTheme()

2.3. model cross_validation

In order to further test the property of generalization of the model, we now going to apply “cross validation” In the process of Cross Validation, we seperate sample data into 10 groups. In each time’s practice we keep one group as “test set” and use the other nine group to train the model, recording the goodness of fit matrices and repeating the process until every group is being used as “test set”

fitControl <- trainControl(method = "cv", number = 10)

set.seed(825)

lmFit <- train(log(SalePrice) ~  ZIPCODE+PTYPE+OWN_OCC+LAND_SF+YR_REMOD1+GROSS_AREA
               +LIVING_ARE+NUM_FLOORS+R_ROOF_TYP+R_EXT_FIN+
                 R_TOTAL_RM+R_BDRMS+R_FULL_BTH+R_HALF_BTH+R_KITCH+
                 highway_dis+road_dens+crime_dens+d_trans+d_education+Name, 
               data = try1,
               method = "lm", 
               trControl = fitControl)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
lmFit
lmFit$resample
sd(lmFit$resample[,3])

ggplot(as.data.frame(lmFit$resample), aes(MAE)) + 
  geom_histogram(bins=4) +
  labs(x="Mean Absolute Error",
       y="Count")

#2 new add************
ggplot(as.data.frame(lmFit$resample), aes(Rsquared)) + 
  geom_histogram(bins=4) +
  labs(x="Rsquared",
       y="Count")

#*********************
# 4predict the value
trynew2<-newdatafinal[,c(1:34,35,36,46,68:73)]
trynew<-try
testdata<-trynew2
testdata <-
  testdata %>%
  mutate(ZIPCODE = as.factor(ZIPCODE),
         PTYPE = as.factor(PTYPE)) %>%
  as.data.frame() %>%
  filter(test == 1)


summary(reg)

prefinal1 <- predict(reg, testdata) 
## Warning in predict.lm(reg, testdata): prediction from a rank-deficient fit
## may be misleading
prefinal1<-exp(prefinal1)

regPredValues <- 
  data.frame(UniqueSale = testdata$UniqueSale,
             predictedSales = prefinal1)
write.csv(regPredValues, 'finalpredictionyep.csv')

Discussion and Final Conclusion

In this time’s assignment, our group creates a regression model via using the observed data from the OpenSourceData from the Internet. The object of this project is to use the data we have to build a generalized model that have strong ability to predict another study area that we don’t have data with. To improve the precision of the prediction power of the model and to make the model more generalized, we applied a couple of methods for the regression model that we built. In the beginning, the Moran’s I provides OLS regression to the model and the p-value is small enough, so the data is proved to be useful, and we can keep using the model for later’s work. Then, we applied in-sample and out-of-sample data training to make sure that the model is generalized enough for different types of study areas. After that, we used the cross-validation calculation to the model to further test the generalization of the regression model.

Via using multiple methods to process data model, we get a variety of regression results, including decision-making factors like the p-value, r square, and absolute r square values. Through analyzing statistic results, we can be confident about our data model, that the model we built up is generalized enough to be applied to other study areas for the house value prediction. However, because this model is built by using the most updated open-source data, it has time limitations for future using, and users should keep the data updated to get a most precise prediction, and the initial data should also be precise enough for generalizing a prediction model with strong power.